Prerequisite

Load multiple packages to your environment using the following code (you can add more packages to the current list as per your need):

knitr::opts_chunk$set(echo = TRUE)

library(pacman)
p_load(tidyverse, foreign, corrplot, stargazer, coefplot, effects, psych, ggcorrplot)

Part 1: The Replication Project

1. Read the paper closely and respond to the following questions:

  1. What are the authors’ research questions?
- RQ: To what extent has the decrease in the racial pay gap over time been
influenced by the different economic sources and trajectories of men and women?
  1. What is the gap in the literature that the authors aim to fill? How does their analysis advance the literature?
- The authors are focused on the intersection of race and gender in producing
wage gaps, and their proposal of a new theoretical framework aims to establish
reasonable expectations for wage disparities between different races/genders.
Importantly, the authors try to create a framework that can account for
changes in the racial pay gap over time. 
  1. What is the population that they are making inferences about? Be specific and make sure to identify the geographical region that they are focusing on, the time period, demographic characteristics, and so on.
  • The population is all black and white workers in the United States for every ten years from 1970 to 2010, ages 25 to 59.
  1. Do they have data on all individuals in the population? If they don’t, how do they solve this?
  • No they do not have data on all individuals in the population. They have to use sample data collected by surveys. Specifically they use a 5% census sample for 1980 and 2000, a 1% census sample for 1970 and 1990, and American Commnunity Survey data for 2010.
  1. Did the authors collect the data themselves? If they did, describe their sampling procedure. If they didn’t, identify and describe the data source and discuss the sampling procedure that was used to collect this data.
  • The authors used secondary data obtained from IPUMS.They use different sources to compile their data set: 1% and 5% US census samples as well as data from the ACS.

2. As you read the text, make a list of all variables or characteristics of the population that the authors mention throughout the paper (e.g., gender, age, wages, occupation, …). Submit a list with all the variables and characteristics that you have identified in the text.

Note: Make sure to read footnotes and table notes; they contain important information to understand who is in the sample.

  • gender (sex)
  • race
  • age
  • earnings
  • level of education
  • potential work experience and it squared (age - yrs of schooling -6)^6
  • weekly working hours
  • weekly working hours logged
  • weekly wage
  • weekly wage logged
  • marital status
  • nativity status
  • number of children
  • presence of a child under 5 (=1)
  • sector, working in public service (=1)
  • region
  • metropolitan area (=1)
  • occupation (1990 codes)

3. Based on your answers to prior questions, select the samples and variables that you think you will need to replicate the paper in your IPUMS account. Submit a screenshot of the page where you can see the samples and variables that you have selected.

Note: you can obtain this from your “Data Cart”.

Part 2: Regression

Import the dataset sat_math.dta to your R environment and examine the effect of IQ and other variables on SAT math score. Hint: use read.dta()

# loading data into environment
sat_data <- read.dta("sat_math.dta") 
Variable Name Variable Detail
sat_math SAT Math Score
female The Female Dummy (Male = 0)
black/other Two Racial Dummies (White as the Reference Group)
meduy Mother’s Years of Schooling
feduy Father’s Years of Schooling
hours Average Weekly Study Hours
IQ IQ Score (0 to 100)

1. Report descriptive statistics:

  1. Create a table that reports descriptive statistics (you should at least report the means) of all the variables grouped by gender
# Grouping data by gender and producing descriptive statistics
sat_data %>% group_by(female) %>% summarise(mean_satmath = mean(sat_math), mean_meduy = mean(meduy), mean_feduy = mean(feduy), mean_hours = mean(hours), mean_IQ = mean(IQ))
  1. Create a correlation matrix and display it.
## Set use = "complete.obs" to ignore observations with NAs
M <- cor(sat_data, use = "complete.obs")

# Save the matrix to a dataframe, then use `ggcorrplot` to visualize 

ggcorrplot(as.data.frame(M), 
           hc.order = TRUE, 
           type = "lower", 
           lab = TRUE)

2. Create scatter plots:

  • Besides the key dependent variable (DV) sat_math, choose one numeric independent variable (IV) that seems to have a meaningful relation to the DV based on the correlation matrix you created, and then create the following plots:
  1. A scatter plot of the DV and IV
  2. A scatter plot of the DV and IV with a fitted linear regression line
  3. A scatter plot of the DV and IV, and each observation is color coded by gender
  4. On top of plot (c), fit a linear regression line for each gender group, the lines should also be color coded
#   (a) A scatter plot of the DV and IV  
sat_data %>%
  ggplot(aes(x = hours, y = sat_math)) +
  geom_point(shape = 1, alpha = 0.7) +
  labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
       x = "Average Weekly Study Hours",
       y = "SAT Math Score")

#  (b) A scatter plot of the DV and IV with a fitted linear regression line  
sat_data %>%
  ggplot(aes(x = hours, y = sat_math)) +
  geom_point(shape = 1, alpha = 0.7) +
  geom_smooth(method = "lm", se = F) +
  labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
       x = "Average Weekly Study Hours",
       y = "SAT Math Score")
## `geom_smooth()` using formula 'y ~ x'

# (c) A scatter plot of the DV and IV, and each observation is color coded by gender       
# Create new variable 'gender'
sat_data <- sat_data %>%
  mutate(gender = ifelse(female == 1, "female", "male"))

sat_data %>% as_tibble() %>% ggplot(aes(x = hours, y = sat_math, color = gender)) +
  geom_point(shape = 1) +
  labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
       x = "Average Weekly Study Hours",
       y = "SAT Math Score",
       subtitle = "Grouped by gender")

# (d) On top of plot (c), fit a linear regression line for each gender group, the lines should also be color coded
sat_data %>% as_tibble() %>%
  ggplot(aes(x = hours, y = sat_math, color = gender)) +
  geom_point(shape = 1) +
  geom_smooth(method = "lm", se = F) +
  labs(title = "Relationship Between Average Weekly Study Hours and SAT Math Score",
       x = "Average Weekly Study Hours",
       y = "SAT Math Score",
       subtitle = "Grouped by gender")
## `geom_smooth()` using formula 'y ~ x'

3. Additional exploratory data analysis:

  1. What are your preliminary findings/reflections on the data based on the descriptive statistics, the correlation matrix, and the scatter plots? - Based on the descriptive statistics, we know that all together students spend an average of 39 hours studying per week and that males and females spend roughly the same amount of time studying each week On average females score 47 points higher than males.In terms of IQ, males on average scored 2 points higher. We also know the data is somewhat skewed to the right (median < mean), which means that less people are making higher SAT math scores than if the data were normally distributed. Based on the correlation plot, IQ has the strongest positive correlation with SAT math scores (0.65). Parents’ years of schooling is also strongly correlated with higher SAT math scores, more so than with the other explanatory variables. For my variable of interest (hours) there appears to be a very minor negative relationship between average hours studied per week and SAT math scores (-0.03). This weak relationship is confirmed through the scatterplots, especially those with regression lines which show that while females tend to score higher, average weekly hours studied is not strongly correlated with SAT math

  2. What other exploratory data analysis will be useful for you to better understand the data before modeling? Please implement some additional exploratory data analysis and discuss your preliminary findings.

  • Other exploratory data anlyses before modeling include plotting the relationship between SAT math scores and our other predictor variables to visually understand their relationships.The data can also be grouped by race in order to look at racial differences in SAT math scores at different levels of predictors.

4. Nested models:

  • Build five nested models that use sat_math as the DV and report regression results in a table using stargazer() from the stargazer package.
  1. Model 1: Baseline (only add “IQ” as the independent variable)
  2. Model 2: Model 1 + Demographic Characteristics
  3. Model 3: Model 2 + Parental Education
  4. Model 4: Model 3 + Weekly Study Hours
  5. Model 5: Model 4 + An Interaction Between IQ and the Female Dummy
# Creating the five models

m1 <- lm(sat_math ~ IQ, data = sat_data)
m2 <- lm(sat_math ~IQ + female + black + other, data = sat_data)
m3 <- lm(sat_math ~IQ + female + black + other + feduy + meduy, data = sat_data)
m4 <- lm(sat_math ~IQ + female + black + other + feduy + meduy + hours, data = sat_data)
m5 <- lm(sat_math ~IQ + female + black + other + feduy + meduy + hours + IQ*female, data = sat_data)

stargazer(m1, m2, m3, m4, m5, type = "text")
## 
## ================================================================================================================================================
##                                                                         Dependent variable:                                                     
##                     ----------------------------------------------------------------------------------------------------------------------------
##                                                                               sat_math                                                          
##                               (1)                      (2)                      (3)                      (4)                      (5)           
## ------------------------------------------------------------------------------------------------------------------------------------------------
## IQ                          4.211***                 4.356***                 3.487***                 3.484***                 2.893***        
##                             (0.154)                  (0.138)                  (0.132)                  (0.132)                  (0.175)         
##                                                                                                                                                 
## female                                              53.831***                51.693***                51.597***                  -9.552         
##                                                      (3.565)                  (3.142)                  (3.143)                  (12.357)        
##                                                                                                                                                 
## black                                               -16.130***               -14.455***               -14.519***               -15.324***       
##                                                      (4.432)                  (3.906)                  (3.906)                  (3.861)         
##                                                                                                                                                 
## other                                                -10.809*                  -6.918                   -6.825                   -6.152         
##                                                      (6.005)                  (5.294)                  (5.295)                  (5.231)         
##                                                                                                                                                 
## feduy                                                                         5.725***                 5.747***                 5.703***        
##                                                                               (0.474)                  (0.475)                  (0.469)         
##                                                                                                                                                 
## meduy                                                                         6.434***                 6.434***                 6.379***        
##                                                                               (0.491)                  (0.491)                  (0.485)         
##                                                                                                                                                 
## hours                                                                                                   -0.264                   -0.255         
##                                                                                                        (0.251)                  (0.248)         
##                                                                                                                                                 
## IQ:female                                                                                                                       1.232***        
##                                                                                                                                 (0.241)         
##                                                                                                                                                 
## Constant                   315.605***               286.110***               184.937***               195.423***               226.197***       
##                             (7.906)                  (7.519)                  (8.921)                  (13.391)                 (14.530)        
##                                                                                                                                                 
## ------------------------------------------------------------------------------------------------------------------------------------------------
## Observations                 1,000                    1,000                    1,000                    1,000                    1,000          
## R2                           0.428                    0.543                    0.646                    0.647                    0.656          
## Adjusted R2                  0.427                    0.541                    0.644                    0.644                    0.653          
## Residual Std. Error    62.734 (df = 998)        56.156 (df = 995)        49.452 (df = 993)        49.450 (df = 992)        48.835 (df = 991)    
## F Statistic         746.637*** (df = 1; 998) 295.587*** (df = 4; 995) 302.435*** (df = 6; 993) 259.415*** (df = 7; 992) 236.007*** (df = 8; 991)
## ================================================================================================================================================
## Note:                                                                                                                *p<0.1; **p<0.05; ***p<0.01

5. For the result of Model 1:

  1. What are the hypotheses that you are testing in this model with your t-values in the (Intercept) and IQ row of the modeling results?
  2. Create a 95% confidence interval for the parameter \(\beta_{\text{IQ}}\) based on Model 1 result.

6. Interpret regression coefficients:

  1. How does the coefficient of “IQ” change across models? What could be the possible reason(s) for such changes?
  2. Interpret the coefficient of “black” in Model 4.
  3. Interpret the coefficient of “meduy” in Model 4.
  4. Interpret the coefficient of the interaction effect between IQ and the Female Dummy in Model 5.

7. Create a coefficient plot for Model 5 with appropriate title and labels.

coefplot(m5, intercept = F, innerCI = 1.96, outerCI = 1.96, color = "black", title = "Coefficient Plot of Model 5")

8. On the basis of Model 5, by holding other variables at their means, create a figure demonstrating the predicted SAT math score by gender and IQ levels (with confidence interval).

Part 3 (Bonus) Data Simulation

Simulation is a fun and effective way to learn about statistical inference. You will get a better understanding of how each population parameter affects the shape of the distribution.

Now that we have learned about how to identify interactions from a given sample, you can try simulate a data whose true data generating process involves interaction between two variables. For example, you can try to reproduce a similar scatter plot we saw in class (the right panel) by simulating a data whose variables have such associations:

Or, you can try to reproduce a scatter plot that demonstrates the Simpson’s Paradox:

Note: Your output does not need to replicate the exact layout of the example graphs. You will get extra credit as long as you generate a similar graph that illustrates the relationship (either a positive or negative interaction, or the Simpson’s Paradox) clearly. Remember to use set.seed() for any random process.